Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
We will go through this notebook to:
In [2]:
import urllib3
import pandas as pd
url = "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat"
#load the csv
airports = pd.read_csv(url,header=None)
print("Check DataFrame types")
display(airports.dtypes)
Here you can find an explanation of each variable:
In [3]:
import numpy as np
print("-> Original DF")
display(airports.head())
#we can add a name to each variable
h = ["airport_id","name","city","country","IATA","ICAO","lat","lon","alt","tz","DST","tz_db"]
airports = airports.iloc[:,:12]
airports.columns = h
print("-> Original DF with proper names")
display(airports.head())
print("-> With the proper names it is easier to check correctness")
display(airports.dtypes)
Convert alt to m
In [4]:
airports.alt.describe()
Out[4]:
In [5]:
airports.alt = airports.alt * 0.3048
In [6]:
airports.dtypes
Out[6]:
Check if we have nans.
In [7]:
airports.isnull().sum(axis=0)
Out[7]:
In [8]:
# we can create a new label whoch corresponds to not having data
airports.IATA.fillna("Blank", inplace=True)
airports.ICAO = airports.ICAO.fillna("Blank")
In [9]:
airports.isnull().sum(axis=0)
Out[9]:
Let's check errors.
In [10]:
((airports.lat > 90) & (airports.lat < -90)).any()
Out[10]:
In [11]:
((airports.lon > 180) & (airports.lon < -180)).any()
Out[11]:
We can chech outliers in the altitude
In [12]:
airports.alt.describe()
Out[12]:
let's explore 5 and 95 percentiles
In [13]:
qtls = airports.alt.quantile([.05,.5,.95],interpolation="higher")
qtls
Out[13]:
In [14]:
# check how many of them are below the median
(airports.alt <= qtls[0.5]).sum()
Out[14]:
In [15]:
#check how many of them are above of the median
(airports.alt >= qtls[0.5]).sum()
Out[15]:
In [16]:
#check how many of them are below the .05 percentile
(airports.alt <= qtls[0.05]).sum()
Out[16]:
In [17]:
#check how many of them are above the .95 percentile
(airports.alt >= qtls[0.95]).sum()
Out[17]:
In [18]:
airports.shape[0]*.05
Out[18]:
In [19]:
print("-> Check which airports are out of 5% range")
display(airports[(airports.alt < qtls[0.05])].head(10))
Additionaly to what we have seen, we have extra functions to see how shaped and what values our data has.
In [20]:
print("-> Showing a sample of ten values")
airports.sample(n=10)
Out[20]:
In [21]:
print("-> Showing the airports in higher positions")
airports.sort_values(by="alt",ascending=True)[:10]
Out[21]:
We can create new variables
In [22]:
airports.tz_db
Out[22]:
In [23]:
airports["continent"] = airports.tz_db.str.split("/").str[0]
airports.continent.unique()
Out[23]:
In [24]:
airports.continent.value_counts()
Out[24]:
In [25]:
(airports.continent.value_counts()/airports.continent.value_counts().sum())*100
Out[25]:
In [26]:
airports[airports.continent == "\\N"].shape
Out[26]:
In [27]:
airports.continent = airports.continent.replace('\\N',"unknown")
airports.tz_db = airports.tz_db.replace('\\N',"unknown")
airports.continent.unique()
Out[27]:
In [28]:
airports[airports.continent == "unknown"].head()
Out[28]:
We can place hemisfere
In [29]:
hem_select = lambda x: "South" if x < 0 else "North"
airports["hemisphere"] = airports.lat.apply(hem_select)
We can calculate percentages.
In [30]:
(airports.hemisphere.value_counts() / airports.shape[0]) * 100
Out[30]:
In [31]:
(airports.continent.value_counts() / airports.shape[0]) * 100
Out[31]:
In [32]:
((airports.country.value_counts() / airports.shape[0]) * 100).sample(10)
Out[32]:
In [33]:
((airports.country.value_counts() / airports.shape[0]) * 100).head(10)
Out[33]:
In [34]:
type(airports.country.value_counts())
Out[34]:
Let's transformate alt into qualitative
In [35]:
airports["alt_type"] = pd.cut(airports.alt,bins=3,labels=["low","med","high"])
In [36]:
airports.head()
Out[36]:
Let's group data:
In [37]:
airp_group = airports.groupby(["continent","alt_type"])
The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. In the above example we have:
In [38]:
airp_group.groups.keys()
Out[38]:
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.
In [39]:
airp_group.size()
Out[39]:
In [67]:
airp_group["alt"].agg({"max":np.max,"min":np.min,"mean":np.mean}).head()
Out[67]:
In [41]:
airports.alt.hist(bins=100)
Out[41]:
Pandas has a handy .unstack() method—use it to convert the results into a more readable format and store that as a new variable
In [68]:
airp_group["alt"].sum().unstack()
Out[68]:
Remember that we also saw how to pivot table
In [69]:
airports.pivot_table(index="hemisphere",values="alt",aggfunc=np.mean)
Out[69]:
In [70]:
airports.groupby("hemisphere").alt.mean()
Out[70]:
In [46]:
my_df = pd.DataFrame(np.ones(100),columns=["y"])
my_df.head(10)
Out[46]:
In [44]:
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
plt.rcParams['figure.figsize'] = [10, 8]
In [47]:
my_df.plot()
Out[47]:
In [79]:
my_df["z"] = my_df.y.cumsum()
my_df.plot()
Out[79]:
In [80]:
my_df.y = my_df.z ** 2
my_df.plot()
Out[80]:
In [81]:
my_df.z = np.log(my_df.y)
my_df.z.plot()
Out[81]:
We can plot with different plot types:
* ‘bar’ or ‘barh’ for bar plots
* ‘hist’ for histogram
* ‘box’ for boxplot
* ‘kde’ or 'density' for density plots
* ‘area’ for area plots
* ‘scatter’ for scatter plots
* ‘hexbin’ for hexagonal bin plots
* ‘pie’ for pie plots
In [48]:
airports.groupby("continent").size().plot.bar()
Out[48]:
In [83]:
airports.groupby("continent").alt.agg({"max":np.max,"min":np.min,"mean":np.mean}).plot(kind="bar")
Out[83]:
In [84]:
airports.groupby("continent").alt.agg({"max":np.max,"min":np.min,"mean":np.mean}).plot(kind="bar",stacked=True)
Out[84]:
In [85]:
airports.groupby("continent").alt.agg({"max":np.max,"min":np.min,"mean":np.mean}).plot(kind="barh",stacked=True)
Out[85]:
In [86]:
airports.alt.plot(kind="hist",bins=100)
Out[86]:
In [87]:
airports.loc[:,["alt"]].plot(kind="hist")
Out[87]:
In [88]:
airports.loc[:,["lat"]].plot(kind="hist",bins=100)
Out[88]:
In [89]:
airports.loc[:,["lon"]].plot(kind="hist",bins=100)
Out[89]:
In [90]:
airports.plot.box()
Out[90]:
In [91]:
airports.alt.plot.box()
Out[91]:
In [92]:
airports.pivot(columns="continent").alt.plot.box()
Out[92]:
In [93]:
sp_airp = airports[airports.country=="Spain"]
spain_alt = sp_airp.sort_values(by="alt").alt
spain_alt.index = range(spain_alt.size)
spain_alt.plot.area()
Out[93]:
In [94]:
airports.plot.scatter(y="lat",x="lon")
Out[94]:
In [95]:
airports.plot.scatter(y="lat",x="lon",c="alt")
Out[95]:
In [96]:
airports.plot.scatter(y="lat",x="lon",s=airports["alt"]/20)
Out[96]:
In [97]:
airports.plot.hexbin(x="lon",y="lat",C="alt",gridsize=20)
Out[97]:
In [98]:
airports.alt.plot.kde()
Out[98]:
In [79]:
airports.lat.plot.kde()
Out[79]:
In [99]:
airports.lon.plot.kde()
Out[99]:
The exercices will be based over 2016 New Coder Survey, which is a survey answered by 15000 coders and contains 46 questions (each question is a variable).
Data is available https://github.com/FreeCodeCamp/2016-new-coder-survey/blob/master/data/2016-New-Coder-Survey-Data-Summary.csv
Over these dataset, please answer the following questions
It is highly recommended that instead of cleaning the whole dataset, you do an error and outlier analyisis to each variable that you are going to use before answering the question
The variables that you need for each question are in the dataset, you have only to browse and select these that correspond
Show in a barplot top 10 nationalities with more responants
Show in a barplot top 10 countires with more responants
Do an outlier analysis of the ages. How many otliers there are using box-and whiskers? How many using 5%-95%
Draw a box plot for ages in USA
Show the average Age per country. Which is the country with older respondants? Which the conutry with younger?
Do an outlier analysis of the incomes. How many otliers there are using box-and whiskers? How many using 5%-95%
Draw a box plot for incomes in Spain
Which is the mean income? And the mean income per age? Plot an area plot. Split Incomes into 4 ranges and plot a barplot for top ten respondant countries with 4 bars counting how many people is in each range
Do a density plot with incomes
Do an histogram with incomes. Select a right number of bins som density plot and histogram are similar
Do an scatter plot, ploting age and commut time with a third variable which is income